For this particular project, we are looking at the stock prices of three major tech companies: Apple, Facebook and Google, from the years of 2012 to 2018. We got access to the dates, opening S&P 500 indices and closing prices of these companies from Yahoo Finance.
Based on the nature of stock market, there can potentially be temporal structures when analyzing and predicting stock prices. We took the closing price as the response variable and tried to fit appropriate models with the S&P 500 indices as the mean structure to help predict the closing prices each day for each company, as well as understanding the temporal structures within.
Due to the fact that all of the companies share the same timeline and have data of their own at the same time points, we would have to fit three individual models for each of them. We will first look at some Explanatory Data Analyses for all three companies, then try to fit simpler models with no temporal structures, and finally fit and evaluate temporal models for each company. For the temporal model fitting part specifically, we will try two different methods: both auto-fitting ARIMA models, as well as models with Gaussian Process. We then compared the performances of all three types models fit by both methods.
Lastly, we wish to come to a conclusion for our questions of interest: can we predict the closing prices of stocks with the S&P 500 index solely? Is there any temporal dependency in the closing prices? Are there differences among different companies or do they share similar trends and structures in their stocks? Can this trend or model, potentially with the temporal structure, potentially be generalized to other companies? We would also have a discussion on the adequacy, potential problems with the models and provide suggestions for developing this project in the future.
summary(df)
## Date Open fb_close google_close
## Min. :2012-05-18 Min. :1278 Min. : 17.73 Min. : 277.7
## 1st Qu.:2013-11-11 1st Qu.:1765 1st Qu.: 49.50 1st Qu.: 504.1
## Median :2015-05-06 Median :2033 Median : 81.73 Median : 581.6
## Mean :2015-05-06 Mean :2003 Mean : 90.37 Mean : 641.2
## 3rd Qu.:2016-10-25 3rd Qu.:2178 3rd Qu.:124.94 3rd Qu.: 779.6
## Max. :2018-04-20 Max. :2867 Max. :193.09 Max. :1175.8
## apple_close
## Min. : 55.79
## 1st Qu.: 81.64
## Median :105.80
## Mean :108.11
## 3rd Qu.:126.81
## Max. :181.72
#time trend
p1 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = fb_close)) +
geom_line() +
ggtitle("Stock Close Price of Facebook")
p2 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = apple_close)) +
geom_line() +
ggtitle("Stock Close Price of Apple")
p3 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = google_close)) +
geom_line() +
ggtitle("Stock Close Price of Google")
p4 = ggplot(data = df %>% arrange(Date), aes(x = Date, y = Open)) +
geom_line() +
ggtitle("S&P 500 Index Open")
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)
#scatter plot
p1 = ggplot(data = df, aes(x = Open, y = fb_close)) +
geom_point(size = 0.1) +
ggtitle("Close~Open for Facebook")
p2 = ggplot(data = df, aes(x = Open, y = apple_close)) +
geom_point(size = 0.1) +
ggtitle("Close~Open for Apple")
p3 = ggplot(data = df, aes(x = Open, y = google_close)) +
geom_point(size = 0.1) +
ggtitle("Close~Open for Google")
grid.arrange(p1, p2, p3, nrow = 2, ncol = 2)
We mainly focused on three variables Date, Open (Opening S&P indices) and Close (Closing stock price) for all three companies.
The line plots of all of the closing prices of the three companies against date, as well as the S&P 500 index against date, show linear trend as time goes by. This is an expected trend as the stock market is continuously inflating, especially since the tech companies have been rapidly developing during the time period that we’re looking at. Similarly, the S&P 500 index has an increasing the linear trend. Thus, by using the indices as a predictor/mean structure, we’re able to de-trend the linear trend in the closing prices and then move on to looking at any remaining temporal structure.
Then we look at scatter plots of all three companies with the closing price against the open S&P 500 indices. Even though there is a bit of discrepancy in the trend shown by Apple, it is more or less closer to a linear trend. Thus, the scatter plots further confirmed using the S&P 500 index as the mean structure.
The first method is fitting a simple linear model for each individual company. We used the S&P 500 index to predict daily closing price for all three companies separately.
#Apple
apple_lm = lm(apple_close ~ Open, data = df)
summary(apple_lm)
##
## Call:
## lm(formula = apple_close ~ Open, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.504 -12.250 -0.823 10.708 35.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -49.549667 2.066132 -23.98 <2e-16 ***
## Open 0.078712 0.001015 77.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.17 on 1488 degrees of freedom
## Multiple R-squared: 0.8016, Adjusted R-squared: 0.8015
## F-statistic: 6012 on 1 and 1488 DF, p-value: < 2.2e-16
df$apple_naive_pred = predict(apple_lm, data=df$Open)
df$apple_naive_residual = df$apple_close - df$apple_naive_pred
ggplot(data = df, aes(x = Date)) +
geom_line(aes(y = apple_close, color = "red")) +
geom_line(aes(y = apple_naive_pred, color = "blue")) +
xlab("time") +
ylab("Apple Daily Close Stock Price") +
ggtitle("Simple Linear Model")
rmse(df$apple_close, df$apple_naive_pred)
## [1] 14.16018
#Facebook
fb_lm = lm(fb_close ~ Open, data = df)
summary(fb_lm)
##
## Call:
## lm(formula = fb_close ~ Open, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.987 -12.587 0.699 11.056 38.127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.687e+02 2.078e+00 -81.18 <2e-16 ***
## Open 1.293e-01 1.021e-03 126.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.25 on 1488 degrees of freedom
## Multiple R-squared: 0.9152, Adjusted R-squared: 0.9151
## F-statistic: 1.605e+04 on 1 and 1488 DF, p-value: < 2.2e-16
df$fb_naive_pred = predict(fb_lm, data=df$Open)
df$fb_naive_residual = df$fb_close - df$fb_naive_pred
ggplot(data = df, aes(x = Date)) +
geom_line(aes(y = fb_close, color = "red")) +
geom_line(aes(y = fb_naive_pred, color = "blue")) +
xlab("time") +
ylab("Facebook Daily Close Stock Price") +
ggtitle("Simple Linear Model")
rmse(df$fb_close, df$fb_naive_pred)
## [1] 14.23986
#Google
google_lm = lm(google_close ~ Open, data = df)
summary(google_lm)
##
## Call:
## lm(formula = google_close ~ Open, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -182.19 -23.43 14.16 40.36 163.93
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.011e+02 9.740e+00 -51.45 <2e-16 ***
## Open 5.703e-01 4.785e-03 119.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.79 on 1488 degrees of freedom
## Multiple R-squared: 0.9052, Adjusted R-squared: 0.9051
## F-statistic: 1.42e+04 on 1 and 1488 DF, p-value: < 2.2e-16
df$google_naive_pred = predict(google_lm, data=df$Open)
df$google_naive_residual = df$google_close - df$google_naive_pred
ggplot(data = df, aes(x = Date)) +
geom_line(aes(y = google_close, color = "red")) +
geom_line(aes(y = google_naive_pred, color = "blue")) +
xlab("time") +
ylab("Google Daily Close Stock Price") +
ggtitle("Simple Linear Model")
rmse(df$google_close, df$google_naive_pred)
## [1] 66.74942
\[ \begin{aligned} Close_t &= ARIMA_{(p, q, d) \times (P, Q, D)_s}(Close_{t-1, \cdots}) + \beta_1 * Open_{(S\&P_500)} \end{aligned} \]
#Apple
apple.ts = auto.arima(df %>% select(apple_close), xreg = df$Open, seasonal = TRUE)
apple.ts %>% summary()
## Series: df %>% select(apple_close)
## Regression with ARIMA(2,1,0) errors
##
## Coefficients:
## ar1 ar2 xreg
## 0.0092 -0.0491 0.0053
## s.e. 0.0303 0.0262 0.0033
##
## sigma^2 estimated as 2.676: log likelihood=-2844.26
## AIC=5696.52 AICc=5696.54 BIC=5717.74
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.05806906 1.633794 1.149067 0.03724077 1.097862 0.9989548
## ACF1
## Training set -0.001345197
ggtsdisplay(apple.ts$residuals, main = "ARIMA(2,1,0)")
#Residual plot looks good and we will go with the result.
df$apple_ts_pred = c(apple.ts$fitted)
ggplot(data = df, aes(x = Date)) +
geom_line(aes(y = apple_close, color = "red")) +
geom_line(aes(y = apple_ts_pred, color = "blue")) +
xlab("time") +
ylab("Apple Daily Close Stock Price") +
ggtitle("ARIMA(2,1,0)")
rmse(df$apple_close, df$apple_ts_pred)
## [1] 1.633794
#Facebook
facebook.ts = auto.arima(df %>% select(fb_close), xreg = df$Open, seasonal = TRUE)
facebook.ts %>% summary()
## Series: df %>% select(fb_close)
## Regression with ARIMA(2,1,2) errors
##
## Coefficients:
## ar1 ar2 ma1 ma2 drift xreg
## -0.0576 0.8242 0.0539 -0.8907 0.0860 0.0021
## s.e. 0.0700 0.0682 0.0574 0.0571 0.0306 0.0031
##
## sigma^2 estimated as 2.82: log likelihood=-2881.82
## AIC=5777.64 AICc=5777.72 BIC=5814.78
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.002329128 1.675428 1.120526 -0.1057361 1.517573 0.9969723
## ACF1
## Training set -0.0004667567
ggtsdisplay(facebook.ts$residuals, main = "ARIMA(2,1,2)")
#After looking at the residual plot, we found spikes at period 30 and tried to see if having a seasonal AR or MA trend makes the model better.
facebook.try1 = Arima(df %>% select(fb_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 0, 1),period = 30))
ggtsdisplay(facebook.try1$residuals)
facebook.try2 = Arima(df %>% select(fb_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(1, 0, 0),period = 30))
ggtsdisplay(facebook.try2$residuals)
#However, by evaluating residual plots, adding seasonal terms don't seem to give a better performance. Plus, we can see that autocorrelation with lag 30 is about 0.1, which is relatively small. So, we decided to stick with the result from auto.arima.
df$fb_ts_pred = c(facebook.ts$fitted)
ggplot(data = df, aes(x = Date)) +
geom_line(aes(y = fb_close, color = "red")) +
geom_line(aes(y = fb_ts_pred, color = "blue")) +
xlab("time") +
ylab("Facebook Daily Close Stock Price") +
ggtitle("ARIMA(2,1,2)")
rmse(df$fb_close, df$fb_ts_pred)
## [1] 1.675428
#Google
google.ts = auto.arima(df %>% select(google_close), xreg = df$Open, seasonal = TRUE)
google.ts %>% summary()
## Series: df %>% select(google_close)
## Regression with ARIMA(2,1,2) errors
##
## Coefficients:
## ar1 ar2 ma1 ma2 drift xreg
## -0.5939 -0.8918 0.5656 0.8422 0.4368 0.0666
## s.e. 0.1005 0.0927 0.1259 0.1076 0.2345 0.0197
##
## sigma^2 estimated as 87.05: log likelihood=-5435.1
## AIC=10884.2 AICc=10884.28 BIC=10921.34
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.02346136 9.307922 6.106312 -0.01130659 0.96099 0.9965447
## ACF1
## Training set 0.00876494
ggtsdisplay(google.ts$residuals, main = "ARIMA(2,1,2)")
#From ACF/PACF plots, we found spikes and potential periodic trend so we tried to add seasonal terms to the model.
google.try1 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 1, 0),period = 7))
ggtsdisplay(google.try1$residuals)
google.try2 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(1, 1, 0),period = 7))
ggtsdisplay(google.try2$residuals)
google.try3 = Arima(df %>% select(google_close), xreg = df$Open, order = c(2, 1, 2),seasonal = list(order = c(0, 1, 1),period = 7))
ggtsdisplay(google.try3$residuals)
#However, after trying different combinations of seasonal terms, we found that ACF/PACF haven't improved much. Plus, autocorrelation value relatively low. So we decided to go with the result from auto.arima function.
df$google_ts_pred = c(google.ts$fitted)
ggplot(data = df, aes(x = Date)) +
geom_line(aes(y = google_close, color = "red")) +
geom_line(aes(y = google_ts_pred, color = "blue")) +
xlab("time") +
ylab("Google Daily Close Stock Price") +
ggtitle("ARIMA(2,1,2)")
rmse(df$google_close, df$google_ts_pred)
## [1] 9.307922
\[ Close_t = \beta X + w_{t} \\ w_{t} \sim GP(0,\Sigma)\\ \Sigma \sim square \ exponential \] Because it takes a long time for JAGS to run large datasets, we subset the dataset to a year of data from 2017-4-21 to 2018-4-20.
apple_emp_cloud = subset %>% emp_semivariogram(apple_naive_residual,Date)
apple_emp = rbind(
subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=5) %>% mutate(binwidth="binwidth=5"),
subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=20) %>% mutate(binwidth="binwidth=20"),
subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=40) %>% mutate(binwidth="binwidth=40"),
subset %>% emp_semivariogram(apple_naive_residual, Date, bin=TRUE, binwidth=30) %>% mutate(binwidth="binwidth=30")
)
apple_emp %>%
ggplot(aes(x=h, y=gamma)) +
geom_point(size = 1) +
ggtitle("Empirical Semivariogram of Apple (binned)")+
facet_wrap(~binwidth, nrow=2)
apple_gp_exp_model = "model{
y ~ dmnorm(mu, inverse(Sigma))
for (i in 1:N) {
mu[i] <- beta[1]+ beta[2] * x[i]
}
for (i in 1:(N-1)) {
for (j in (i+1):N) {
Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
Sigma[j,i] <- Sigma[i,j]
}
}
for (k in 1:N) {
Sigma[k,k] <- sigma2 + sigma2_w
}
for (i in 1:2) {
beta[i] ~ dt(coef[i], 2.5, 1)
}
sigma2_w ~ dnorm(10, 1/25) T(0,)
sigma2 ~ dnorm(50, 1/25) T(0,)
l ~ dt(0,2.5,1) T(0,)
}"
if (file.exists("apple_gp_jags.Rdata")) {
load(file="apple_gp_jags.Rdata")
} else {
m = rjags::jags.model(
textConnection(apple_gp_exp_model),
data = list(
y = subset$apple_close,
x = subset$Open,
d = dist(subset$Date) %>% as.matrix(),
N = nrow(subset),
coef = coef(apple_lm)
),
quiet = TRUE
)
update(m, n.iter=2000)
exp_cov_coda = rjags::coda.samples(
m, variable.names=c("beta", "sigma2", "l", "sigma2_w"),
n.iter=2000, thin=10
)
save(exp_cov_coda, file="apple_gp_jags.Rdata")
}
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
ungroup() %>%
mutate(.variable = paste0(.variable, "[",i,"]")) %>%
select(-i)
betas %>%
group_by(.variable) %>%
slice(seq(1,n(),length.out=500)) %>%
ggplot(aes(x=.iteration, y=.value, color=.variable)) +
geom_line() +
facet_grid(.variable~., scales = "free_y")
params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
slice(seq(1,n(),length.out=500)) %>%
ggplot(aes(x=.iteration, y=.value, color=.variable)) +
geom_line() +
facet_grid(.variable~., scales="free_y")
params %>%
slice(seq(1,n(),length.out=500)) %>%
ggplot(aes(x=.value, fill=.variable)) +
geom_density() +
facet_wrap(~.variable, scales="free") +
guides(fill=FALSE)
params %>%
slice(seq(1,n(),length.out=500)) %>%
filter(.variable == "l") %>%
ggplot(aes(x=.value, fill=.variable)) +
geom_density() +
scale_x_log10() +
facet_wrap(~.variable, scales="free") +
guides(fill=FALSE)
post = bind_rows(betas, params) %>%
group_by(.variable) %>%
summarize(
post_mean = mean(.value),
post_med = median(.value),
post_lower = quantile(.value, probs = 0.025),
post_upper = quantile(.value, probs = 0.975)
)
knitr::kable(post, digits = 5)
| .variable | post_mean | post_med | post_lower | post_upper |
|---|---|---|---|---|
| beta[1] | 59.63543 | 55.80959 | 37.46380 | 89.79846 |
| beta[2] | 0.03976 | 0.04138 | 0.02815 | 0.04903 |
| l | 0.09253 | 0.09374 | 0.07121 | 0.11468 |
| sigma2 | 48.76159 | 48.84577 | 39.31196 | 56.76181 |
| sigma2_w | 2.96200 | 2.92310 | 2.34422 | 3.71781 |
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(apple_gp_resid = apple_close - beta0 - beta1 * Open)
reps=1000
x = subset$Open
y = subset$apple_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o = sq_exp_cov(dist_o, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p = sq_exp_cov(dist_p, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
apple_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
apple_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
geom_line(aes(y=apple_close)) +
geom_ribbon(data=apple_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
geom_line(data=apple_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) +
ylab("Apple Daily Close Stock Price")
#use mean as the predicted value from bayes to calculate RMSE
rmse(subset$fb_close,apple_pred_df_bayes$post_mean)
## [1] 10.59504
fb_emp_cloud = subset %>% emp_semivariogram(fb_naive_residual,Date)
fb_emp = rbind(
subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=1) %>% mutate(binwidth="binwidth=1"),
subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=5) %>% mutate(binwidth="binwidth=5"),
subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=15) %>% mutate(binwidth="binwidth=15"),
subset %>% emp_semivariogram(fb_naive_residual, Date, bin=TRUE, binwidth=30) %>% mutate(binwidth="binwidth=30")
)
fb_emp %>%
ggplot(aes(x=h, y=gamma)) +
geom_point(size = 1) +
ggtitle("Empirical Semivariogram of Facebook (binned)")+
facet_wrap(~binwidth, nrow=2)
fb_gp_exp_model = "model{
y ~ dmnorm(mu, inverse(Sigma))
for (i in 1:N) {
mu[i] <- beta[1]+ beta[2] * x[i]
}
for (i in 1:(N-1)) {
for (j in (i+1):N) {
Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
Sigma[j,i] <- Sigma[i,j]
}
}
for (k in 1:N) {
Sigma[k,k] <- sigma2 + sigma2_w
}
for (i in 1:2) {
beta[i] ~ dt(coef[i], 2.5, 1)
}
sigma2_w ~ dnorm(10, 1/25) T(0,)
sigma2 ~ dnorm(390, 1/200) T(0,)
l ~ dt(0,2.5,1) T(0,)
}"
if (file.exists("fb_gp_jags.Rdata")) {
load(file="fb_gp_jags.Rdata")
} else {
m = rjags::jags.model(
textConnection(fb_gp_exp_model),
data = list(
y = subset$fb_close,
x = subset$Open,
d = dist(subset$Date) %>% as.matrix(),
N = nrow(subset),
coef = coef(fb_lm)
),
quiet = TRUE
)
update(m, n.iter=2000)
exp_cov_coda = rjags::coda.samples(
m, variable.names=c("beta", "sigma2", "l", "sigma2_w"),
n.iter=2000, thin=10
)
save(exp_cov_coda, file="fb_gp_jags.Rdata")
}
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
ungroup() %>%
mutate(.variable = paste0(.variable, "[",i,"]")) %>%
select(-i)
betas %>%
group_by(.variable) %>%
slice(seq(1,n(),length.out=500)) %>%
ggplot(aes(x=.iteration, y=.value, color=.variable)) +
geom_line() +
facet_grid(.variable~., scales = "free_y")
params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
slice(seq(1,n(),length.out=500)) %>%
ggplot(aes(x=.iteration, y=.value, color=.variable)) +
geom_line() +
facet_grid(.variable~., scales="free_y")
params %>%
slice(seq(1,n(),length.out=500)) %>%
ggplot(aes(x=.value, fill=.variable)) +
geom_density() +
facet_wrap(~.variable, scales="free") +
guides(fill=FALSE)
params %>%
slice(seq(1,n(),length.out=500)) %>%
filter(.variable == "l") %>%
ggplot(aes(x=.value, fill=.variable)) +
geom_density() +
scale_x_log10() +
facet_wrap(~.variable, scales="free") +
guides(fill=FALSE)
post = bind_rows(betas, params) %>%
group_by(.variable) %>%
summarize(
post_mean = mean(.value),
post_med = median(.value),
post_lower = quantile(.value, probs = 0.025),
post_upper = quantile(.value, probs = 0.975)
)
knitr::kable(post, digits = 5)
| .variable | post_mean | post_med | post_lower | post_upper |
|---|---|---|---|---|
| beta[1] | 138.00653 | 137.23548 | 92.06659 | 190.34675 |
| beta[2] | 0.01138 | 0.01184 | -0.00740 | 0.03109 |
| l | 0.06331 | 0.06302 | 0.05771 | 0.07039 |
| sigma2 | 385.32748 | 384.58300 | 357.71997 | 417.85007 |
| sigma2_w | 4.45172 | 4.45002 | 3.73450 | 5.41662 |
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(fb_gp_resid = fb_close - beta0 - beta1 * Open)
reps=1000
x = subset$Open
y = subset$fb_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o = sq_exp_cov(dist_o, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p = sq_exp_cov(dist_p, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
fb_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
fb_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
geom_line(aes(y=fb_close)) +
geom_ribbon(data=fb_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
geom_line(data=fb_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) +
ylab("Facebook Daily Close Stock Price")
#use mean as the predicted value from bayes to calculate RMSE
rmse(subset$fb_close,fb_pred_df_bayes$post_mean)
## [1] 2.027406
google_emp_cloud = subset %>% emp_semivariogram(google_naive_residual,Date)
google_emp = rbind(
subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=1) %>% mutate(binwidth="binwidth=1"),
subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=5) %>% mutate(binwidth="binwidth=5"),
subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=10) %>% mutate(binwidth="binwidth=10"),
subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=15) %>% mutate(binwidth="binwidth=15"),
subset %>% emp_semivariogram(google_naive_residual, Date, bin=TRUE, binwidth=30) %>% mutate(binwidth="binwidth=30")
)
google_emp %>%
ggplot(aes(x=h, y=gamma)) +
geom_point(size = 1) +
ggtitle("Empirical Semivariogram of Google (binned)")+
facet_wrap(~binwidth, nrow=2)
google_gp_exp_model = "model{
y ~ dmnorm(mu, inverse(Sigma))
for (i in 1:N) {
mu[i] <- beta[1]+ beta[2] * x[i]
}
for (i in 1:(N-1)) {
for (j in (i+1):N) {
Sigma[i,j] <- sigma2 * exp(- pow(l*d[i,j],2))
Sigma[j,i] <- Sigma[i,j]
}
}
for (k in 1:N) {
Sigma[k,k] <- sigma2 + sigma2_w
}
for (i in 1:2) {
beta[i] ~ dt(coef[i], 2.5, 1)
}
sigma2_w ~ dnorm(100, 1/100) T(0,)
sigma2 ~ dnorm(700, 1/400) T(0,)
l ~ dt(0,2.5,1) T(0,)
}"
if (file.exists("google_gp_jags.Rdata")) {
load(file="google_gp_jags.Rdata")
} else {
m = rjags::jags.model(
textConnection(google_gp_exp_model),
data = list(
y = subset$google_close,
x = subset$Open,
d = dist(subset$Date) %>% as.matrix(),
N = nrow(subset),
coef = coef(google_lm)
),
quiet = TRUE
)
update(m, n.iter=2000)
exp_cov_coda = rjags::coda.samples(
m, variable.names=c("beta", "sigma2", "l", "sigma2_w"),
n.iter=2000, thin=10
)
save(exp_cov_coda, file="google_gp_jags.Rdata")
}
betas = tidybayes::gather_draws(exp_cov_coda, beta[i]) %>%
ungroup() %>%
mutate(.variable = paste0(.variable, "[",i,"]")) %>%
select(-i)
betas %>%
group_by(.variable) %>%
slice(seq(1,n(),length.out=500)) %>%
ggplot(aes(x=.iteration, y=.value, color=.variable)) +
geom_line() +
facet_grid(.variable~., scales = "free_y")
params = tidybayes::gather_draws(exp_cov_coda, sigma2, l, sigma2_w)
params %>%
slice(seq(1,n(),length.out=500)) %>%
ggplot(aes(x=.iteration, y=.value, color=.variable)) +
geom_line() +
facet_grid(.variable~., scales="free_y")
params %>%
slice(seq(1,n(),length.out=500)) %>%
ggplot(aes(x=.value, fill=.variable)) +
geom_density() +
facet_wrap(~.variable, scales="free") +
guides(fill=FALSE)
params %>%
slice(seq(1,n(),length.out=500)) %>%
filter(.variable == "l") %>%
ggplot(aes(x=.value, fill=.variable)) +
geom_density() +
scale_x_log10() +
facet_wrap(~.variable, scales="free") +
guides(fill=FALSE)
post = bind_rows(betas, params) %>%
group_by(.variable) %>%
summarize(
post_mean = mean(.value),
post_med = median(.value),
post_lower = quantile(.value, probs = 0.025),
post_upper = quantile(.value, probs = 0.975)
)
knitr::kable(post, digits = 5)
| .variable | post_mean | post_med | post_lower | post_upper |
|---|---|---|---|---|
| beta[1] | -498.77787 | -500.77321 | -505.52283 | -468.68757 |
| beta[2] | 0.58398 | 0.58450 | 0.57281 | 0.58877 |
| l | 0.09541 | 0.09349 | 0.07905 | 0.12334 |
| sigma2 | 697.70238 | 697.63588 | 663.18587 | 732.68595 |
| sigma2_w | 126.16484 | 126.77453 | 113.52041 | 139.03941 |
l = post %>% filter(.variable == 'l') %>% pull(post_med)
sigma2 = post %>% filter(.variable == 'sigma2') %>% pull(post_med)
sigma2_w = post %>% filter(.variable == 'sigma2_w') %>% pull(post_med)
beta0 = post %>% filter(.variable == 'beta[1]') %>% pull(post_med)
beta1 = post %>% filter(.variable == 'beta[2]') %>% pull(post_med)
df = df %>% mutate(google_gp_resid = google_close - beta0 - beta1 * Open)
reps=1000
x = subset$Open
y = subset$google_close
d = subset$Date
x_pred = subset$Open
d_pred = subset$Date + rnorm(252, 0.01)
mu = beta0 + beta1*x
mu_pred = beta0 + beta1*x_pred
dist_o = fields::rdist(d)
dist_p = fields::rdist(d_pred)
dist_op = fields::rdist(d, d_pred)
dist_po = t(dist_op)
cov_o = sq_exp_cov(dist_o, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_p = sq_exp_cov(dist_p, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_op = sq_exp_cov(dist_op, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cov_po = sq_exp_cov(dist_po, sigma2 = sigma2, l = l, sigma2_w = sigma2_w)
cond_cov = cov_p - cov_po %*% solve(cov_o) %*% cov_op
cond_mu = mu_pred + cov_po %*% solve(cov_o) %*% (y - mu)
pred_bayes = cond_mu %*% matrix(1, ncol=reps) + t(chol(cond_cov)) %*% matrix(rnorm(length(x_pred)*reps), ncol=reps)
google_pred_df_bayes = pred_bayes %>% t() %>% post_summary() %>% mutate(x=x_pred)
google_pred_df_bayes$Date = subset$Date
ggplot(subset, aes(x=Date)) +
geom_line(aes(y=google_close)) +
geom_ribbon(data=google_pred_df_bayes, aes(ymin=post_lower,ymax=post_upper, x=Date), fill="red", alpha=0.5) +
geom_line(data=google_pred_df_bayes, aes(y=post_mean), color='blue', size=0.5) +
ylab("Google Daily Close Stock Price")
#use mean as the predicted value from bayes to calculate RMSE
rmse(subset$google_close,google_pred_df_bayes$post_mean)
## [1] 12.0692